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Abstract 


This study investigated the relationship between students’ actual performance (accuracy) and 
their subjective judgments of accuracy (confidence) on selected English language proficiency 
tests. The unidimensional and multidimensional IRT Rasch approaches were used to model the 
discrepancy between confidence and accuracy at the item and test level and to assess 
disattenuated strength of association between accuracy and confidence. The analysis results 
indicate a pattern of overconfidence bias (i.e., overestimation of success rate), which was related 
to item difficulty. In addition, the strength of association between accuracy and confidence 
dimension was relatively high: The confidence dimension explained 45% and 52% of the 
variability in the accuracy dimension for the two tests employed in this study. 

Key words: Confidence, Rasch model, accuracy, multidimensional Rasch model 
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Introduction 

Statistical modeling based on item response theory (IRT) has become one of the most 
widely used psychometric methods in educational testing. The Rasch (Rasch, 1980), two- 
parameter, and three-parameter logistic models (Bimbaum, 1968) have become increasingly 
popular over the past 20 years, in particular for scaling, equating, item banking, computerized 
adaptive testing, standard setting, and test assembly purposes (see Embretson & Reise, 2000; 
Hambleton & Swaminathan, 1985; Hambleton, Swaminathan, & Rogers, 1991; Wright & Stone, 
1979; Yen & Fitzpatrick, 2006). 

In the field of decision making, measures of confidence and accuracy have been studied 
for economic and weather forecasting. Forecasters’ confidence and their actual performance are 
typically compared. In psychological and educational assessments, a concept of bias scores 1 (i.e., 
differences between accuracy and confidence scores) was introduced (see Stankov & Crawford, 
1996; Stankov & Lee, 2007) and the relationship between confidence and accuracy has been 
examined, for the most part, by employing the calibration curves approach (see Hakstian & 
Kansup, 1975; Keren, 1991; Lichtenstein, Fischoff, & Phillips, 1982). 

The IRT approach has not been used in the studies of confidence and accuracy, but its 
methodological components can be very useful in helping to understand these constructs. IRT 
modeling utilizes a latent construct at the ability level, either at a local point or within a certain 
range, and can provide the means to estimate the discrepancy between confidence and accuracy 
at the item or test levels. The IRT item parameters (e.g., item difficulty) model stimulus 
characteristics, allowing us to further examine the stimulus properties and latent constructs of 
confidence and accuracy. 

For this current investigation, we employed one of the popular IRT models, the Rasch 
model, which exhibits additive conjoint measurement (Perline, Wright, & Wainer, 1979). 
Instruments fitting the Rasch model have sufficiency properties of simple observed statistics for 
model parameters, separablility of person and item parameters, and specific objectivity (Fischer, 
1995; Hoijtink & Boomsma, 1995; Molenaar, 1995; Rasch, 1966; Wright & Master, 1982). We 
explored in this study four substantive questions, as follows: 

1. Are people’s confidence levels comparable to their accuracy levels? 

2. Do people’s confidence levels change as their accuracy levels change? 

3. Are people’s confidence levels related to the difficulty level of a specific task? 
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4. What is the strength of association between people’s accuracy and confidence levels? 

A unidimensional Rasch model calibration was used to answer the first three questions, 
and a multidimensional Rasch model calibration was used for the last question. We also 
developed indices to express people’s confidence levels relative to their accuracy levels. Those 
indices were based on the unidimensional Rasch calibration results. Although these four 
questions could be investigated in a much more substantive manner, the focus of this paper is on 
illustrating the application of Rasch modeling and on proposing potentially useful tools in the 
context of researching the nature of the relationship between confidence and accuracy. 

Data 

The data for accuracy were based on the item responses from the Reading and Listening 
sections of an English language proficiency test (N= 820). We used 24 multiple-choice (MC) 
Reading items and 17 MC Listening items for this study. Each MC item had four options. All the 
MC items were dichotomously scored as either 0 (incorrect) or 1 (correct). 

The data for confidence were obtained from the participants’ self-rating on their level of 
confidence about getting each item correct. They were asked to give their subjective rating 
(expressed in percentage terms) immediately after responding to a test item. The question below 
along with the subjective rating range was given to the participants to rate their confidence: 

How Confident Are You That Your Answer Is Correct? 

This approach of collecting confidence levels has been employed by Crawford and 
Stankov (1996), Juslin (1994), Keren (1991), and Stankov and Crawford (1996). In Crawford 
and Stankov’s (1996) and Stankov & Crawford’s (1996) studies, the participants were allowed to 
use any integer number as a percentage to express their confidence levels (for instance, 53%). 
However, they used only numbers rounded to the nearest 10. With a convention of assigning 
20% as the lowest probability value to incorporate a guessing factor in MC questions with five 
options, a subjective probability range of 20% to 100% was used in this study. 

Method 

In this study, we have defined accuracy as the dimension underlying the performance on 
the English language proficiency test, and confidence as the dimension producing the subjective 
probability (i.e., a person’s subjective judgment on the correctness of each item, expressed as a 
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percentage). The Rasch model was fitted to the dichotomously scored item responses. The 
general form of the Rasch model for binary item responses is: 


P{X, = x, 9) = 


exp [x,(0-Si)\ 
1 + exp(6* - 8 j ) 


(1) 


where X, is an indicator variable for the z'th item (1 < i < I), jc,- is its realization (1 for correct and 0 
for incorrect), 9 is a person’s latent score (latent ability), and 5, is the z'th item difficulty. P{X l = 
x, 9) is a probability of getting the zth item response x, for a person’s latent score 9. From this 
Rasch calibration, we obtained the objective probability for answering each item correctly for 
each person and his or her latent accuracy score. 

For research questions 1 and 2 (Are people’s confidence levels comparable to their 
accuracy levels? and Do people’s confidence levels change as their accuracy levels change?), the 
objective probability was compared with the subjective probability for each item and at the total 
test level. The probability comparisons and their interpretations are: 

Objective probability = P, = E(X t = \ \6 ac ) = P(X j = 1| 0 ac ), (2) 

77 ni 

Subjective probability = P* = E(n i \0 ac ) = ? ( 3 ) 

/ j l\®ac(ri) ^ac] 
n 

Overconfidence = P* - P t > 0, and (4) 

Underconfidence = P* -P : < 0, (5) 


where P, is defined through Equation 1, 9 ac is a latent accuracy score, n m is the zzth person’s 

subjective probability on the z'th item (1 <n< N), /[•] is an indicator function (1 if 0 lu:{n) = 0 ac ; 0 

otherwise), and is zzth person’s estimate on the latent accuracy dimension. Equation 3 is a 

regression of the subjective rating onto accuracy and is estimated by averaging the subjective 
rating for all participants with the same ability on the accuracy dimension. Objective probability 
and person latent scores are obtained through the unidimensional Rasch model calibration for the 
accuracy data. Subjective probability is estimated by Equation 3. A summary statistic for the 
discrepancy between confidence and accuracy at the item level can be expressed as: 
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( 6 ) 


Item-level discrepancy index (IDI,) = 

Sac 

where w( # ac ) is a relative frequency of a person’s latent score distribution at d ac , that is, 

N 0 

= (V) 

with N = total sample size and N e = number of persons at 6 ac , 

Overconfidence: IDI, > 0 for the /'th item, and (8) 

Underconfidence: IDI, < 0 for the z'th item. (9) 

In this study, we classify the size of IDI by the differential item functioning (DIF) effect size on 
the probability scale suggested by Dorans and Holland (1993). In their classification, sizes |0.05| 
and |0.10| are used as thresholds for negligible, medium, and large DIF. With this rule applied to 
the discrepancy between accuracy and confidence, the following criteria for overconfidence can 
be established: 

Large discrepancy = IDI, >0.10, (10) 

Medium discrepancy = 0.05 < IDI, <0.10, and (11) 

Small or negligible discrepancy = 0< IDI, < 0.05. (12) 

For underconfidence, the thresholds -0.05 and -0.10 are used for classification. These IDI cutoff 
points can be seen as arbitrary choices, but no such theoretical cutoff points had been established 
at the time that this paper was written. We introduce these rules as a start, and their usefulness 
can be verified in future studies. 

An index for the discrepancy between confidence and accuracy at the test level in a given 
6*ac is defined as follows: 


Objective expected score = ^ /) , 

i 

(13) 

Subjective expected score = ^P *, 

(14) 
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Overconfidence at the total test score level = ^P* - '^ j P j >0, and 


(15) 


Underconfidence at the total test score level 




<0. 


(16) 


A summary statistic for the test-level discrepancy index (TDI) for a given 6 ac is defined as: 

Tro-EEX-Ewo, (i7) 

e ac i < 

Overconfidence on average: TDI > 0, and (18) 

Underconfidence on average: TDI < 0. (19) 

A more flexible version of the TDI may be expressed as: 

Discrepancy percentage (DP) = —— ' j t ^ ^ - x 100 , (20) 

where the test length is the number of items used in the calculation. The DP is a signed measure 
of underconfidence or overconfidence standardized by the test length, which is an adjustment for 
comparisons between tests with different test lengths. 

For the person latent score estimation, the expected a posteriori (EAP; Bock & Mislevy, 
1982) estimator was used. The EAP is an expectation of a person’s posterior probability on the 
ability distribution given the person’s item response string X = {x x , x 2 , • • •,x,)' and the item 
parameters {< 5 ,} . Its mathematical expression is: 


E(0 1X, {«,•}) 


f 0P(X\0,{8 i })g(0)d0 
f P(X\0,{6 i })g(0)d0 


( 21 ) 


The prior distribution for 6, g(0), was assumed to follow a univariate nonnal distribution for the 
unidimensional Rasch calibrations and a bivariate normal distribution for the multidimensional 
Rasch calibrations. 

To answer our third research question (Are people’s confidence levels related to the 
difficulty level of a specific task?), the item difficulties obtained from the unidimensional Rasch 
calibrations were compared to the item-level discrepancy index. In order to answer the fourth 
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research question (What is the strength of association between people’s accuracy and confidence 
levels?), the variance-covariance matrix for the accuracy and confidence dimensions was 
estimated by a multidimensional Rasch modeling. Because the participants provided their 
confidence ratings in a probabilistic response metric, we simulated the dichotomous item 
responses for the confidence dimension. Assuming that the confidence rating is the subjective 
probability defined by the item response function, IRF(# ac , 0 CO ), where 6 C0 is a latent confidence 
dimension, (# ac , 0 CO ) that follows a bivariate distribution with mean vector y and variance- 
covariance and assuming strict monotonicity of IRF with regard to the latent dimension for 
confidence 6 C0 ~ the subjective probability rating can be expressed as: 

Subjective probability rating = \R¥ m (0 a( ,0 co ) = IRF„, (6f 0 ) = P(Y ni = 11 9 C0 ), (22) 

y m for 9 C0 = 1 if P(Y ni = 116» co ) > u ~ uniform (0,1), or 
0 otherwise, 

where Y m is nth person’s z'th item response from the confidence dimension 9 co , and u is a random 
draw from the standard unifonn distribution. The equality between lRF ni (9 ac ,9 co ) and 
lKF ni (9 co ) in Equation 22 is due to the simple structure (i.e., an item loads onto a single 
dimension) in the model data fitting. 

The multidimensional Rasch model used in this study has the following form: 


P{X id 


I _ exp [x id (9 d -S id )\ 
idl d) 1 + exp (9 d -6 id ) 


(24) 


where d is an indicator (for either accuracy or confidence) on which dimension the z'th item 
loads, and Sm is the z'th item difficulty on dimension d. The variance-covariance matrix was 
modeled through ( 0 dC , <9 C0 ) ~ MVN(/z, X), where MVN stands for multivariate normal 
distribution, //, is 2 x 1 mean vector, and X is 2 x 2 variance-covariance matrix. The variances 
and correlations (or covariances) are model parameters and are estimated directly from the item 
response matrix. The program ConQuest (Wu, Adams, & Wilson, 1998) was used for both 
unidimensional and multidimensional Rasch model calibrations. ConQuest is capable of 
estimating all possible model variants based on the multidimensional random coefficient 
multinomial logit model (MRCMLM; Adams, Wilson, & Wang, 1997). Thus, this program can 
perform the fitting of a variety of Rasch family models, such as the rating scale model (Andrich, 
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1978), the partial-credit model (Masters, 1982), the facet model (Linacre, 1989), and the linear 
logistic test model (LLTM; Fisher, 1989). ConQuest uses marginal maximum likelihood (MML) 
estimation with the expectation-maximization (EM) algorithm (Bock & Aitkin, 1981; Dempster, 
Laird, & Rubin, 1977). 

The investigation of item misfit for the Rasch model was performed through the weighted 
mean squared (WMNSQ) fit statistic, using ConQuest. The WMNSQ fit statistic is based on the 
difference between the observed and predicted responses. It approaches 1 when the model fits 
the data well (Wright & Masters, 1982; Wu, 1997). Adams and Khoo (1996) and Wilson (2005) 
suggested the interval of [0.75, 1.33] of the WMNSQ for the item fit criterion. All items used for 
this study were within this interval. The WMNSQ fit statistics are provided in Appendix A, 
along with the item difficulties and their standard errors. 

Results 

We organized the results of this study by the four research questions presented in the 
introduction to this paper. 

1. Are People’s Confidence Levels Comparable to Their Accuracy Levels? 

The results of the calculations for IDI, TDI, DP, and the classifications of the IDI values 
are shown in Table 1. The Listening section had nine small IDIs, three medium IDIs, and five 
large IDIs. The IDI values for items with medium or large IDIs were all positive and ranged from 
0.065 to 0.207; these positive IDIs indicated overconfidence expressed on those items. Within 
the Listening section, 47% [100*(8/17)] of the items showed evidence of overconfidence. At the 
test level, the TDI was 0.82, indicating that the participants overpredicted their Listening section 
score by 0.82 score points on average. 

The Reading section had 10 small IDIs, 4 medium IDIs, and 10 large IDIs. The IDI 
values for items with medium or large IDIs were all positive and ranged from 0.061 to 0.633; 
these positive IDIs indicated overconfidence expressed on those items. Within the Reading 
section, 58% [100*(14/24)] of the items showed evidence of overconfidence. At the test level, 
the TDI was 2.89, indicating that the participants overpredicted their Reading section score by 
2.89 score points on average. 
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Table 1 

Item Discrepancy Index (IDI) and Test-Level Discrepancy Index (TDI) for Listening and 


Reading 



Listening 


Reading 

IDI 

IDI class 


IDI 

IDI class 

Item 1 

0.119 

L 

Item 1 

0.048 


Item 2 

-0.002 


Item 2 

0.001 


Item 3 

0.004 


Item 3 

0.149 

L 

Item 4 

0.064 

M 

Item 4 

0.004 


Item 5 

0.023 


Item 5 

0.042 


Item 6 

-0.012 


Item 6 

0.179 

L 

Item 7 

-0.003 


Item 7 

0.016 


Item 8 

0.065 

M 

Item 8 

0.299 

L 

Item 9 

0.141 

L 

Item 9 

0.046 


Item 10 

0.111 

L 

Item 10 

0.078 

M 

Item 11 

-0.045 


Item 11 

0.099 

M 

Item 12 

-0.027 


Item 12 

0.010 


Item 13 

0.066 

M 

Item 13 

0.109 

L 

Item 14 

0.207 

L 

Item 14 

0.162 

L 

Item 15 

0.112 

L 

Item 15 

0.141 

L 

Item 16 

-0.004 


Item 16 

0.010 


Item 17 

0.001 


Item 17 

0.089 

M 




Item 18 

0.290 

L 




Item 19 

0.061 

M 




Item 20 

0.257 

L 




Item 21 

0.004 





Item 22 

0.047 





Item 23 

0.118 

L 




Item 24 

0.633 

L 

TDI 

0.820 


TDI 

2.894 


DP 

4.8 


DP 

12.1 



Note. DP = discrepancy percentage, M = medium size IDI, L = large size IDI. 

DPs were 4.8 for the Listening section and 12.1 for the Reading section, which reveals 
that the participants showed more pronounced overconfidence in the Reading section than in the 
Listening section. 
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2. Do People’s Confidence Levels Change as Their Accuracy Levels Change? 

Figures 1 and 2 show a few example items from the Listening and Reading sections, 
respectively. In these plots, the values of P t (objective probabilities for accuracy shown in 


Equation 2) and P* (subjective probabilities for confidence shown in Equation 3) were plotted to 
see how people’s confidence may change according to their accuracy at the item level. Vertical 
dotted lines in Figures 1 and 2 represent the IRT Rasch item difficulties. A higher difficulty 
value represents a more difficult item. Figure 1 (Listening) shows three example items, each of 
which represents an item with a different size of IDI: Item 7 with a small IDI (IDI = -0.003), 

Item 13 with a medium IDI (IDI = 0.066), and Item 14 with a large IDI (IDI = 0.207). The rest of 
the items with medium or large IDIs showed the same pattern as Items 13 and 14. In general, the 
tendency to show overconfidence decreased as ability on the accuracy dimension increased; that 
is, the participants with higher accuracy scores tended to predict their actual perfonnance more 
accurately than those with lower accuracy scores did. 


Item 7 


Item 13 




Item 14 



Theta 

Figure 1. Item plots for listening. 

Note. Circles represent subjective probabilities (confidence); solid lines represent objective 
probabilities (accuracy); and the vertical dotted lines represents item difficulties. 
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Item 4 


Item 19 




Theta 


Item 20 



Figure 2. Item plots for reading. 

Note. Circles represent subjective probabilities (confidence); solid lines represent objective 
probabilities (accuracy); and the vertical dotted lines represents item difficulties. 

Figure 2 shows the three example items from the Reading section. Item 4 shows a small 
IDI (IDI = 0.004), Item 19 has a medium IDI (IDI = 0.061), and Item 20 has a large IDI (IDI = 
0.257). The rest of the Reading section items with medium or large IDIs showed overconfidence 
patterns that were very similar to those seen in Items 19 and 20. In general, as people’s ability on 
the accuracy dimension increased, the confidence ratings (expressed in subjective probability) 
came closer to their objective performance (shown by objective probability). 

Figure 3 illustrates the change in overconfidence as a function of ability on the accuracy 
dimension at the test level. The plots were constructed using the objective expected score 

i 

(Equation 13) and the subjective expected score for a given 0 ac . Overall, both the 

i 

Listening and Reading sections showed pronounced patterns of overconfidence across all ability 
levels on the accuracy dimension. However, the participants’ subjective predictions (i.e., 
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confidence) became closer to their actual performance level as their ability level increased. One 
may claim that better prediction of participants’ performance is associated with improved 
precision of their predictions (i.e., reduction of variability in their confidence rating). Others may 
argue that the better predication of higher-ability people could be due to the ceiling effect in the 
confidence rating scale that is bounded between 0 and 1 in probability. 


Listening 


O 

O 

</) 

+■» 

o 

I- 

T3 

<D 

O 

<D 

o. 

x 

LU 



- 4-2 0 2 


Confidence Total 


A Accuracy Total 


Accuracy Theta 


Reading 



Accuracy Theta 


Figure 3. Confidence and accuracy at the test score level. 


Figure 4 shows the average standard deviation (SD) for the subjective probability 
conditional on 0 ac . The average SD for the subjective probability decreased as the ability on the 
accuracy dimension increased. 
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Listening 



Figure 4. Average standard deviation plot for subjective probability. 

The plots in Figures 1 through 4 were constructed using P* (subjective probabilities on 
the /'th item) and ^ P* (subjective expected test scores). Because of their nonparametric nature, 

i 

different sample sizes at each level of 6 ac could affect the stability of P* and ^ P* . To examine 

i 

the degree of heterogeneity in the stability for P* and ^ P *, the frequencies at each 0 ac were 

i 

calculated, which are shown in Table 2. Both the Listening and Reading sections were easy for 
this group of participants, and average item difficulties were -2.09 and -1.70. Thus, higher 
frequencies are observed at the higher end of the accuracy scale, resulting in more stability for 
SD, P *, and £ P* at the higher-ability range on the accuracy dimension. However, the ceiling 
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effect in the confidence rating scale used in the present study, as addressed before, could be 
again contributing to this reduced variability at the higher end of ability. 

Table 2 


Frequency Distributions for Ability Estimate on Accuracy 



Listening 



Reading 


Number 

correct score 

Ability 
estimate on 
accuracy 

Frequency 

Number 

correct score 

Ability 
estimate on 
accuracy 

Frequency 

2 

-3.47 

1 

4 

-3.22 

2 

3 

-3.15 

7 

5 

-2.97 

4 

4 

-2.85 

4 

7 

-2.53 

7 

5 

-2.58 

9 

8 

-2.31 

11 

6 

-2.32 

5 

9 

-2.10 

16 

7 

-2.06 

2 

10 

-1.91 

13 

8 

-1.81 

16 

11 

-1.72 

22 

9 

-1.57 

18 

12 

-1.53 

24 

10 

-1.32 

25 

13 

-1.33 

35 

11 

-1.07 

38 

14 

-1.13 

34 

12 

-0.80 

50 

15 

-0.93 

41 

13 

-0.51 

71 

16 

-0.73 

36 

14 

-0.19 

127 

17 

-0.51 

42 

15 

0.21 

143 

18 

-0.27 

59 

16 

0.68 

186 

19 

-0.01 

69 

17 

1.26 

118 

20 

0.29 

86 




21 

0.64 

91 




22 

1.04 

103 




23 

1.51 

99 




24 

2.05 

26 


Total N 

820 


Total N 

820 


3. Are People’s Confidence Levels Related to the Difficulty Level of a Specific Task? 

The Rasch item difficulty estimated from the unidimensional Rasch calibration for the 
accuracy dimension was compared with the item-level confidence measure (IDI). The correlation 
between the IDI and the item difficulties was 0.92 for the Listening section and 0.93 for the 
Reading section, showing a strong relationship between the item difficulty and item-level 
overconfidence. However, these relationships were rather curvilinear: As the item difficulty 
increased, the overconfidence increased in a nonlinear fashion. To find a trend in the nonlinear 
relationship, simple linear and polynomial regression models were fitted, and the model 
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comparisons were conducted using a general F- test in a forward selection manner with a level of 
0.05 (see, for example, Sen & Srivastava, 1990, for F-test details). A lower-degree regression with 
respect to the predictor (difficulty) was compared to a one-degree higher polynomial regression. 
The model comparison continued consecutively from the simple linear model until the statistical 
test showed nonsignificance. We found that the second-degree polynomial model provided the best 
trend-line function describing the relationship between the item difficulty and the IDI for both the 
Listening and Reading sections. Figure 5 shows the plots where the second-degree polynomial line 
was overlaid with the coefficient of determination (R ) and the regression equations estimated. The 
increment in R was increased by 10% and 11% when the linear relationship was replaced by the 
second-degree polynomial. These R changes were statistically significant. The third-degree 
polynomial relationship did not allow rejection of the second-degree polynomial. 


Listening 


0.3 

0.2 

0.1 

0 

- 0.1 


Cor 


E(y) = 0.0393x 2 + 0.2418x + 0.3542 


R 2 = 0.9467 


Q O Q) O O 
u O 


O 


CO o 
oo 


o 


-4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 

Item Difficulty 


Reading 


0.7 

0.5 

0.3 

0.1 

- 0.1 


Cor = 0.93 


E(y) = 0.0192x + 0.1531x + 0.2974 
R 2 = 0.9579 


o 


^Oo 


o cP 06^0 


8b o<^ 


0 


-2 -1 0 

Item Difficulty 


Figure 5. Relationship between item difficulty and item discrepancy index (IDI). 
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4. What Is the Strength of Association Between People’s Accuracy and Confidence Levels? 

Table 3 shows the variance, covariance, and correlation between accuracy and confidence 
scores estimated from the multidimensional Rasch analysis. The correlations between accuracy 
and confidence were 0.67 and 0.72 for the Listening and Reading sections, respectively. 
Confidence alone explained 45% of the variation in the accuracy dimension for the Listening 
section and 52% for the Reading section. 

Table 3 


Latent Correlation Between Confidence and Accuracy Dimensions 



Listening 

Reading 


Accuracy 

Confidence 

Accuracy 

Confidence 

Accuracy 

1.286 

0.805 

1.628 

1.019 

Confidence 

0.670 

1.124 

0.720 

1.231 


Note. Values below the diagonal are correlations, and values above are covariances. 


Discussion 

The IRT Rasch model-based approach allowed us to examine under/overconfidence at 
any desired level of ability at the item or test levels. Investigations could be conducted at certain 
ability points or intervals, which can be defined as a local level of confidence, or across all 
ability ranges. Furthermore, confidence can be examined at an individual item level, over a 
subset of items, or across the overall test. The local under/overconfidence indices at the item and 
test levels (Equations 6, 17, and 20) can also be calculated with a modification for an interval of 
interest in the latent dimension 9. Their mathematical expressions are shown in Appendix B. The 
graphical representations, such as Figures 2 and 3, illustrate the gaps between accuracy and 
confidence, making it easy to diagnose magnitudes of confidence at points on the accuracy 
dimension and to examine the pattern of the relationship between confidence and accuracy. 

The proposed IDI, TDI, DP, LIDI, LTDI, and LDP are robust against outliers in that they 
are weighted summary indices. These indices are calculated not only by the gap between 
confidence and accuracy, but also by their relative frequency distributions. For example, a 
contribution to the calculation of these indices for a large gap with a fairly small relative 
frequency would be minimized by the weighting factor, relative frequency. Summary indices 
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more sensitive to outliers can be obtained by omitting the tenn w(0 ac ) in Equations 6, 17, 20, Bl, 
B2, and B3 ,which would produce unweighted summary measures. The comparison between the 
unweighted and weighted indices would indicate the extent to which a relatively small number of 
people appear to have large gaps. 

The weight w(6 ac ) in Equations 6, 17, 20, Bl, B2, and B3 makes use of the individual 0 ac 
distribution. Instead, another choice of weight is to use the population weight. In the MML 
estimation of the Rasch models, 9 ac was assumed to follow a normal distribution in this study. 
Thus, the population weight w{9 ac ) is 


Population weight w(6* c ) 


1 

a 



a 


(25) 


where d(-) is the standard normal density function, /u is the mean of the normal distribution, a is 
the standard deviation of the nonnal distribution, and 9* ac is the quadrature point used in the 
MML estimation. The use of a population weight may be more attractive when the population 
weights are estimated as model parameters in a semi-parametric IRT MML approach (see, for 
example, Mislevy, 1984 for nonparametric modeling of the population distribution). 

Although the IDI, TDI, and DP are useful summary measures, they have limitations. In 
particular, when confidence and accuracy are crossing in such a way that positive and negative 
gaps are essentially offsetting in the calculation of the difference between the two, the graphical 
representation should be used together with the IDI, TDI and DP. We can also quantify such a 
crossing trend by modifying Equations 6, 17, Bl, and B2. Instead of using the difference between 
accuracy and confidence, the absolute difference (or the square difference) can be used. For 

example, (/)*-/)) can be replaced with | /)* — /) (or ( P* - P ^j ) and P* - 'jfy P, with 

i i 




(or 


Up'-!.? 


) prior to the summation over 9 ac , when the absolute difference 


I i i I V i i J 

is employed. A crossing pattern between confidence and accuracy is likely to be observed when 
there is a small IDI or TDI with large values of this modified version of IDI or TDI. Note that this 
modified version of IDI or TDI cannot provide directional information (i.e., overconfidence or 
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underconfidence) as our initially proposed IDI, TDI, and DP. Again, graphical displays with these 
summary indices can show local under/overconfidence at the item or test levels. 


Summary 

This study employed the IRT analyses for the Reading and Listening sections of an 
English language proficiency test. One of the main purposes in this paper was to show the 
usefulness of IRT modeling in studying the relationship between confidence and accuracy. In our 
analyses, we treated confidence and accuracy as two different constructs. Based on the 
unidimensional Rasch model calibration, we proposed effect-size indices at the item and test 
levels that are potentially useful in measuring underconfidence or overconfidence. In addition, 
the use of the multidimensional Rasch modeling allowed us to investigate the strength of 
association between these two latent dimensions. It should be noted that the procedures and 
proposed over/underconfidence indices (e.g., IDI, TDI, and DP) in this study are applicable in 
general to whatever IRT models are adopted for analysis. 

The IRT approach was also useful in providing answers for the substantive research 
questions that we had. We showed that the participants were overconfident in general: They 
tended to overestimate their performance on both the Listening and Reading sections of the 
English language proficiency test. Their overconfidence was negatively related to their ability 
levels and was positively related to the item difficulty. The strength of association between the 
accuracy and confidence latent dimensions was moderate: The confidence dimension alone 
explained 45% (Listening) and 52% (Reading) of the variability in the accuracy dimension. 

There are a couple of cautionary statements that we want to make. One is that the 
overconfidence pattern observed in this study may not be salient in other cognitive subject tests 
(Juslin & Olsson, 1997). Especially, the pattern of the overconfidence could be different under a 
different testing environment, such as computerized adaptive testing where the item difficulty level 
is optimized for the examinee ability to calculate his or her ability level efficiently. 4 Another 
caution that should be made is that the decrease of the gap between accuracy and confidence at the 
high end of ability may be due to the subjective rating that was given on a probability unit with the 
maximum boundary of unity. If we were to gather the subjective rating on confidence into a scale 
set with no maximum boundary, we may be able to examine whether the better prediction of 
higher-ability participants has to do with the form of the subjective rating scale. 


17 



Although this study proposed potentially useful measures for under/overconfidence, these 
measures lack statistical significance testing procedures, which is one of the limitations of the 
current study. In addition, the thresholds for under/overconfidence classification at the test level 
were not developed in the present study. A simple classification for TDI can be made by the 
following rule: 


Small or negligible: test length*v M < | TDI |, 

Medium: test length*v M < | TDI | < testlengthAy, and 
Large: |TDI| > test length*^, 

where v M and v L are the desired minimum expected threshold values of an item to be qualified 
as a medium or large TDI. If we use a rather conservative approach of v M = 0.05 (for medium) 
and i> L = 0.10 (for large), the thresholds are 0.85 (17*0.05) and 1.7 (17*0.1) for the Listening 
section and 1.2 (24*0.05) and 2.4 (24*0.1) for the Reading section. We see that the TDI 
classification is small for the Listening section and large for the Reading section. The use of 0.05 
and 0.10 is considered conservative because we can expect the IDIs to be at least medium or 
large to be flagged as medium or large TDI. This means that it may not capture the amplification 
effect arising from, say, many small positive IDIs whose effects could be accumulated to be 
relatively large TDI values. Both the classification of TDI and the development of the statistical 
significance testing described in this section warrant more research. 

We also want to mention that this study did not account for a potential serial correlation 
impact from the use of the same stimulus. Item responses for the accuracy and confidence ratings 
were gathered under the same item stem, which may cause additional dependency between the 
accuracy and confidence dimensions in the data. Also, Reading and Listening items that share 
common stimuli (e.g., the same reading passage) may have extra data dependency as well, which 
is commonly known as a testlet effect. Modeling these potential clustering effects caused by the 
use of the same stimulus in a set of items can be explored in future studies. 
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Notes 

1 The term bias is typically associated with statistical differential item function (DIF) techniques 
and used as a term for test fairness in educational testing context. Readers should not be 
confused by this. 

' For the multidimenssional Rasch model calibration, the strict monotonicity in both 6 ac and 8 co 
with normality are assumed, but the minimal assumptions to generate item responses for 
confidence do not need to specify normality, the shape of IRF, and strict monotonicity in 9 ac . 

3 

Gauss-Hermite quadratures are used in the evaluation of an integral in the marginal likelihood 
during the MML estimation. ConQuest provides empirical Bayes solutions in the Rasch 
calibration for the normal distribution mean and variance. In the actual estimation, only the 
variance of the normal distribution is estimated, while the mean is fixed as 0 because of the 
identifiability issue. The number of quadrature points can be user-specified values (or default 
values can be used) in ConQuest. 

4 In addition, the results could be very different with samples from different populations. 
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Appendix A 

Item Difficulty Estimate and Item Fit Statistic 


Table A1 


Listening 


Item no. 

Difficulty 

SE 

WMNSQ 

1 

-1.129 

0.086 

1.05 

2 

2.175 

0.108 

1.04 

3 

-2.412 

0.116 

0.92 

4 

-1.656 

0.095 

0.96 

5 

-1.986 

0.103 

0.96 

6 

-2.736 

0.128 

1.01 

7 

-2.803 

0.131 

1.02 

8 

-1.457 

0.091 

1.09 

9 

-0.942 

0.084 

1.02 

10 

-1.391 

0.090 

1.11 

11 

-3.084 

0.144 

0.97 

12 

-3.332 

0.157 

0.94 

13 

-1.638 

0.095 

1.00 

14 

-0.758 

0.082 

1.12 

15 

-1.490 

0.092 

0.98 

16 

-3.126 

0.146 

0.98 

17 

-3.383 

0.160 

0.84 


Note. WMNSQ = weighted mean squared. 


23 



Table A2 
Reading 


Item no. 

Difficulty 

SE 

WMNSQ 

1 

-2.484 

0.116 

1.01 

2 

-3.718 

0.176 

0.98 

3 

-1.222 

0.088 

0.98 

4 

-2.258 

0.109 

0.88 

5 

-2.653 

0.121 

0.94 

6 

-0.704 

0.083 

0.97 

7 

-3.191 

0.144 

1.04 

8 

-0.467 

0.082 

1.04 

9 

-1.803 

0.098 

0.89 

10 

-1.891 

0.100 

0.84 

11 

-1.316 

0.089 

1.07 

12 

-2.907 

0.131 

0.94 

13 

-1.993 

0.102 

0.87 

14 

-0.718 

0.083 

0.98 

15 

-1.025 

0.086 

1.02 

16 

-2.697 

0.123 

0.99 

17 

-1.584 

0.094 

0.91 

18 

-0.269 

0.081 

1.05 

19 

-2.056 

0.103 

1.03 

20 

-0.663 

0.083 

0.90 

21 

-3.299 

0.150 

0.90 

22 

-2.580 

0.119 

0.94 

23 

-1.160 

0.087 

1.00 

24 

1.864 

0.098 

1.31 


Note. WMNSQ = weighted mean squared. 
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Appendix B 

Local Item-Level Discrepancy Index, Local Test-Level Discrepancy Index, 
and Local Discrepancy Percentage 


Local IDI (LIDI) = £ (P* , 

OacdOsA] 

Local TDI (LTDI) = []T\ P* - P t ]w(6 ac ), and 

OaMOsA] i 


Local DP (LDP) = 


E E^-E^m<u 

flicStftsA.] ' _ [ _ 

Test Length 


x 100 , 


where 6s and 9l are the smallest and the largest boundary of interest in 6. 


(Bl) 

(B2) 

(B3) 
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